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Abstract 

Due to rapid technological advances, a wide range of different measurements can be 
obtained from a given biological sample including single nucleotide polymorphisms, copy 
number variation, gene expression levels, DNA methylation and proteomic profiles. Each 
of these distinct measurements provides the means to characterize a certain aspect of 
biological diversity, and a fundamental problem of broad interest concerns the discovery 
of shared patterns of variation across different data types. Such data types are hetero- 
geneous in the sense that they represent measurements taken at very different scales or 
described by very different data structures. We propose a distance-based statistical test, 
the generalized RV (GRV) test, to assess whether there is a common and non-random 
pattern of variability between paired biological measurements obtained from the same 
random sample. The measurements enter the test through distance measures which can 
be chosen to capture particular aspects of the data. An approximate null distribution 
is proposed to compute p-values in closed-form and without the need to perform costly 
Monte Carlo permutation procedures. Compared to the classical Mantel test for associ- 
ation between distance matrices, the GRV test has been found to be more powerful in a 
number of simulation settings. We also report on an application of the GRV test to detect 
biological pathways in which genetic variability is associated to variation in gene expres- 
sion levels in ovarian cancer samples, and present results obtained from two independent 
cohorts. 

1 Introduction 

A proliferation in the development and application of high-throughput measurement plat- 
forms in biological research has resulted in the increasing availability of multiple levels of 
molecular data available for the same biological sampling units. For a range of human tu- 
mour types, for instance, the Cancer Genome Atlas (TCGA) consortium have made available 
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genome-wide SNP genotypes, DNA copy number, genome-wide gene (and mRNA) expression, 
DNA methylation and proteomic profiles, as well as a range of mutation calls arising from 
deep sequencing. Such comprehensive molecular profiling of biological sampling unit provides 
opportunities for gaining insight into the mechanisms by which different molecular entities 
interact with one another to influence the overall state of the cell(s) in question. This is 
possible through comparing the different types of observations on each biological sampling 
unit, via some means of paired analysis of the individual datasets. A certain degree of com- 
plexity is inherent in this type of analysis since the measurements that are being compared 
are heterogeneous. 

One well-studied example of the task of analysing paired heterogeneous biological data 
is gene expression quantitative trait loci (eQTL) mapping, which seeks to find a ssociations be- 



tween single nucleotide polymorphisms (SNPs) and transcript expression levels (jCookson et al. 



20091 ). A particularly interesting application of multivariate eQTL mapping relates to the dis 



covery of association between biomarkers for ovarian cancer, which is treated study 
in this work. 

The application of paired analysis of heterogeneous biological data types is not limited 
to eQTL mapping, and involves the very common problem of discovering shared patterns of 
variation across different biological measurements. Other areas that have seen considerable 
i nterest include the asso ciation of genetic copy number variation with gene expression levels 



(jBeroukhim et all . 120071) and identification of genes for which DNA methylation is associated 



to .ene expression TEoTrhimo and Hautanieml liB . In these and related studies, the prob- 
lem of comparing heterogeneous data types can be approached using a notion of distance 
between pairs of biological samples. Distances are scalar- valued measures of dissimilarity 
between two input observations, with greater values indicating greater dissimilarity between 
them. For a given application, a suitable semi-metric or metric distance measure can gen- 
erally be defined depending on the nature of the data and on the specific objectives of the 
study. Such distances computed between all pairwise combinations of samples are arranged 
in square, symmetric matrices. These matrices then quantify the variability in the random 
sample according to that particular distance measure, and can be directly compared using an 
appropriate statistical test. 

Distance-based approaches offer a number of advantages compared to direct comparisons 
involving the original measurements. Firstly, distances can overcome limitations of traditional 
multivariate approaches when dealing w ith high-dimens i onal random vectors observed on a 



much smaller number of sampling units ( Shannon et al. . 20021 ). Secondly, distance measures 



can be appropriately defined for data generated under different experimental conditions, and 
are not confined to data types represented only by vectors. A number of distanc e measur es 



are r eadily available for several types of biologica l data , including DNA sequences ( Wu et al 



1997), genetic markers such as SNPs (IGoh et al. 201 1\U gene e xpression data (Priness et al 



2007 ). longitudinal gene expression data ( Minas et al . 201 ll ) and proteins ( Hollich et al 



20051 ). Moreover, for a specific data type, more than one distance measure can be deployed 



to capture different aspects of the data. This can ultimately lead to the discovery of different 
types of associations. 

Only a handful of statistical procedures are currently available to determine whether there 
is an association between paired distance matrices, and they all rely on computationally inten- 
sive procedures for statistical infe rence. The most commonly used and well-known procedure 
is the Mantel test (jMantell . Il967l ). a generalization of Pearson's correla tion. The test has al- 



ready been found useful in many applications arising in bioinformatics ( Shannon et al. . 2002; 
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Beckmann et all 120051 ; ISun et all l201lh . This testing procedure however, has some limita- 



tions. First, when used to compare paired scalar-valued observation s using the Euclidean dis- 
tance, the test exhibits less powe r than a Pearson's correlation test ()Peres-Neto and Jacksonl . 
200ll : iLeeendre and Fortinl . l2Q10h . Secondly, computationally intensive Monte Carlo permu- 



tations are required to assess significance. This procedure, which assumes exchangeability 
of the sampling units under the null hypothesis of no association, introduces sampling er- 
rors which leads to inaccurate estimation of small p- values. This is especially true when too 
few permutations are used, generally betw e en O (T0 3 ) and O(10 5 ) ( Berry and Mielke . 19831 ; 
Mielke and Berrvl . 120071 : iKniinenburg et all l2009h . For example, in order to obtain a permu- 



tation p- value within 10 -5 of the true p- value, it has been shown that O(10 7 ) permutations are 
required, which is computationally infeasible without access to specialised high-performance 
computational resources. This is the case in all situations where thousands or even millions 
of tests are required, such as in eQTL mapping studies in which the entire genome is scanned 
in search of potential associations. In such large-scale applications, it is not uncommon to 
settle for a much smaller number of Monte C arlo permutations per test , which in turn can 
lead to inflated family-wise type I error rates (IPhipson and SmvthLlioioh . 

In this article we propose a novel test, the generalized RV (GRV) test, to detect association 
between paired distance matrices. This test is derived as a generalization of the classical 
mu ltivariate RV te st of no correlation between paired random vectors, originally proposed 
bv lEscoufierj (Il973h . While inference can be performed via computationally intensive Monte 
Carlo permutations, we also derive an approximation to the exact null distribution which 
would be obtained by enumerating all permutations. Using the proposed null distribution, 
approximate p-values can be estimated in closed-form without the need for computationally 
intensive procedures. The proposed GRV test is introduced in Section [2j In Section [3] we 
describe power studies showing that the GRV test can offer greater statistical power than 
that achieved by the Mantel test even when many millions of Monte Carlo permutations 
are performed. Section H] presents an application of the GRV test in cancer research, where 
the interest lies in detecting biological pathways characterized by a non-random association 
between genetic and gene expression sample variability. Concluding remarks are presented in 
Section [5j 



2 The generalized RV test 
2.1 Problem statement 

Assume we have observed paired measurements {{xi,yi)}^L 1 on N independent biological 
sampling units. Each one of the {xi}^L l and measurement sets is represented by a 

particular data structure, such as continuous or discrete vectors, although the data do not 
necessarily need to be represented as vectors. For instance, the measurements might be struc- 
tured as graphs or trees (e.g. representing biological networks or ontologies). We also suppose 
that suitably chosen semi-metric or metric distances d x and d y have been identified that cap- 
ture the dissimilarity between any sample pair {(xj, Xj)}i^j and {(yi, yj)}i^j, respectively. On 
evaluating the distances for all possible pairs, we obtain the paired N x N distance matrices 
A x = {d x (xi,Xj)}f j=1 and A y = {d y (yi,yj)}f j=1 . 

Using the random sample, evidence of non-random association between these paired ge- 
nomic measurements can be assessed by testing the null hypothesis that A. x 7^ aA y , for 
some positive constant a. This constant represents possible scaling differences between the 
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elements of each distance matrix, which may arise because of the chosen distance measures. 
Under this null hypothesis, the relationship between the elements in A x is not maintained by 
the corresponding elements in A^. The alternative hypothesis is that the distance matrices 
are equal up to a constant, in which case every distance d x (xi,Xj) is linearly related to the 
paired distance d y (y, , y,) . We are particularly concerned with settings where inferences must 
be drawn for a very large number of such tests simultaneously, hence the computational cost 
in obtaining a p-value for each test should be kept as low as possible. 



2.2 The proposed test statistic 

When the observations are vectorial, the mean-centered vectors {xi}f =1 £ M. p and {yi}fLi G 
can be arranged in paired data matrices, X and Y, of dimensions N x P and N x Q, 
respectively. In this case, in order to establish whether the P and Q variables are correlated, 
the null hypothesis that Jl xy = is commonly tested, where X! xy is the P x Q covariance 
matrix containing the true covariances between the P variables observed in {xi}f =1 and Q 
variables observed in {yi}f = i- The RV st atistic has been proposed for this task and arises as 
a generalization of Pearson's correlation ( Escoufier . 19731 ). As with Pearson's correlation, RV 



is computed as the ratio of covariance to the square-rooted product of the variances, 

N tr (X T YY T X) 
RV(X,Y) = —^-— > (1) 



where tr(-) is the trace function and 1 1 A\ \ = y tr(A T A) denotes the Frobenius norm for matrix 
A. The RV statistic ranges between and 1, with no association when X T Y = 0, and perfect 
association when there exists a linear mapping which relates every P-dimensional observation 
in X to every Q-dimensional observation in Y. Thus larger values provide evidence against 
the null hypothesis. When P = Q = 1, RV(X ,Y) = cor 2 (X,l^), where cor(-,-) denotes 
Pearson's correlation coefficient. 

We first observe that the RV test can be interpreted in terms of the Euclidean distance ma- 
trices A. x and arising by applying the Euclidean distance measure to {xi]f =l and {yi]f = i, 
respectively. This is because, due to properties of the trace operator, tr{X T YY T ' X) = 
ti(XX T YY), ||X T X|| = ||XX T || and \\Y T Y\\ = UlT^H, so that RV statistic Q can be 
written in equivalent form as 

N tr (XX T YY T ) 
RV(X,Y)= ||x V||||yyT|| - ( 2 ) 

Note that ([1]) differs from ([2]) in that emphasis is placed on the two symmetric N x N matrices 
XX T and YY T , instead of the four covariance matrices X T Y e M PxQ , Y T X £ M. QxP , 
X T X <E M PxP and Y T Y £ M« x( 2. In settings where P,Q > N, which are common when 
comparing biological measurements, it is computationally convenient to use ([2]) over (pQ). 

In addition, the N x N matrices XX T and YY T can each be written in terms of the 
Euclidean distance matrices A. x and A. y , respectively. This is due to the results that 

XX T = -\CA 2 X C and YY T = -\cAlC, 

2 2 y 

where C = (In — Jn/N) is the symmetric N x N centering matrix where In is the identity 
matrix of size N and Jn is the square matrix of ones of size N (|Gowerl . ll96d : IBorg and Groenenl . 
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20051 ). Also, —\Cb? x C = G x , where G x is the square, symmetric and real- valued Gower's 



centered inner product matrix, and similarly for G y , so that tr (XX T YY T ^j = tr (G x G y ), 
\\XX T \\ = 1 1 Gal | and Hl^l^H = HGyH. Thus RV is completely specified by the elements of 
the Euclidean distance matrices. 

Based on these observations, we propose a generalization for the RV statistic that is 
entirely specified by any pair of distance matrices, not necessarily Euclidean, in order to test 
the null hypothesis that A x ^ aA. y . The generalized RV (GRV) test statistic is defined as 

GRV(G„G,)= (3) 

noting the implicit assumption \ \G X \ \ \ \G y \ \ > which is always satisfied in practice; | \G X \ \ > 
since G x contains real- valued elements which are not all zero, and similarly for G y . 

It is insightful to understand in which ways the proposed GRV statistic is similar to, 
and differs from, the classical Mantel statistic. The Mantel statistic is computed as the 
correlation between the upper-triangular distances of each distance matrix. Denote by v x 
the column vector containing the A = N(N — 1) /2 upper-triangular distances {d x (xi, Xj)}i>j, 
and similarly for v y . The Mantel statistic is then computed as ^/(A^, A y ) = cor (v x , v y ), 
i.e., 

r M (A A)- 1 ^2 f dx ^ Xi,Xj ^ - x \( d v^ y ^ ~ y S 

where x = ^«>j d x (xi, Xj) / A, s x = ^2i > j{d x (xi, Xj) — x) 2 /{A — 1), and similarly for y and s y . 
The statistic therefore standardizes the upper-triangular elements of each distance matrix to 
have zero mean and unit variance. In this way the distances between paired distance matrices 
can be directly compared. 

It can be shown that, like Mantel, GRV is also a correlation coefficient. In particular, it can 
be proved that GRV is equal to the correlation between the vectorized matrices G^/HG^I and 
G y /||G y ||, i.e., the ^-dimensional vectors g x and g y with elements {g x (xi,Xj)/\\G x \\}fj =1 
and {g y {yi,yj)/\\Gy\\}^ j=1 , respectively. That is, GKV(G x ,G y ) = cor(g x ,g y ) (see Appendix 
lAl for a proof). While both Mantel and GRV are correlation coefficients, the difference 
between them lies in the methods of standardization applied to the distances in each case. 
In Mantel, the upper-triangular distances are subjected to a classical standardization, where 
their mean is subtracted and they are divided by their standard deviation. In GRV, however, 
all distance elements are considered, and they are squared, double-centered and normalized 
by dividing by their Frobenius norm. This difference leads to greater power of the GRV test to 
detect association between paired distance matrices than the Mantel test, as demonstrated in 
Section [3j Further properties of the GRV test statistic are provided in Appendix |Bj including 
a discussion of how it overcomes the limitation of Mantel for scalar- valued observations and 
Euclidean distances. 



2.3 Approximate null distribution 

Under the null hypothesis, the sampling distribution of GRV is generally unknown. This is 
because the quantity T = ti(G x G y ) in the numerator is completely specified by the elements 
of A x and A y , which have unknown distributions and are correlated. The null sampling 
distribution can be generated by using permutations of one of the centered inner product 
matrices, G y , say. For each of N n permutations tt £ II, which are one-to-one mappings of the 
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Table 1: Power of the GRV and Mantel tests at a significance level of 0.1% for N = {30, 50, 70, 90, 100} and all combinations of the genetic (gen) 
idcntity-by-state (IBS), Sokal and Sneath (SS), and Rogers and Tanimoto I (RTI) distances and gene expression (gex) Euclidean (Euc), Manhattan 
(Man) and Mahalanobis (Mah) distances. For the GRV test, the standard deviation of the power estimate is given in brackets. For the Mantel 
test, the confidence interval obtained with 95% coverage probability is given in square brackets. The average number of Monte Carlo permutations 
required to achieve the power estimates for Mantel are stated below the confidence interval in millions. 



Gen 


Gex 


Test 












N 










dist 


dist 






30 




50 




70 




90 




100 


IBS 


Euc 


GRV 


0.126 


(0.042) 


0.351 


(0.064) 


0.566 


(0.074) 


0.712 


(0.049) 


0.780 


(0.051) 






Mantel 


0.081 


[0.042,0.126] 
3.081 


0.116 


[0.056,0.183] 
1.575 


0.225 


[0.145,0.291] 
9.921 


0.275 


[0.228, 0.325] 
13.51 


0.322 


[0.272, 0.374] 
19.61 




Man 


GRV 


0.139 


(0.049) 


0.375 


(0.068) 


0.590 


(0.066) 


0.730 


(0.059) 


0.772 


(0.065) 






Mantel 


0.069 


[0.023,0.120] 
1.859 


0.151 


[0.091,0.226] 
2.134 


0.240 


[0.173,0.306] 
6.290 


0.302 


[0.238,0.355] 
13.80 


0.378 


[0.314, 0.444] 
8.794 




Mah 


GRV 


0.394 


(0.061) 


0.846 


(0.047) 


0.952 


(0.030) 


0.978 


(0.020) 


0.990 


(0.014) 






Mantel 


0.100 


[0.049,0.169] 
2.818 


0.311 


[0.262,0.355] 
28.53 


0.542 


[0.513, 0.572] 
135.9 


0.715 


[0.694, 0.735] 
424.2 


0.776 


[0.761,0.789] 
666.8 


SS 


Euc 


GRV 


0.038 


(0.029) 


0.157 


(0.042) 


0.299 


(0.067) 


0.470 


(0.064) 


0.556 


(0.063) 






Mantel 


0.035 


[0.006, 0.064] 
0.971 


0.111 


[0.069,0.154] 
5.318 


0.180 


[0.112,0.245] 
3.032 


0.268 


[0.202,0.329] 
7.445 


0.319 


[0.254,0.381] 
10.69 




Man 


GRV 


0.045 


(0.028) 


0.160 


(0.046) 


0.308 


(0.068) 


0.503 


(0.073) 


0.556 


(0.068) 






Mantel 


0.047 


[0.018,0.073] 
2.879 


0.121 


[0.080,0.171] 
10.27 


0.195 


[0.133,0.269] 
4.371 


0.310 


[0.242,0.387] 
7.325 


0.363 


[0.290, 0.424] 
9.232 




Mah 


GRV 


0.092 


(0.041) 


0.481 


(0.074) 


0.784 


(0.059) 


0.907 


(0.037) 


0.927 


(0.034) 






Mantel 


0.072 


[0.030,0.112] 
5.567 


0.266 


[0.194,0.340] 
8.635 


0.510 


[0.448, 0.564] 
21.92 


0.652 


[0.613,0.687] 
74.12 


0.723 


[0.689, 0.756] 
40.01 


RTI 


Euc 


GRV 


0.044 


(0.027) 


0.159 


(0.047) 


0.340 


(0.062) 


0.551 


(0.067) 


0.654 


(0.063) 






Mantel 


0.032 


[0.010,0.065] 
1.327 


0.106 


[0.061,0.154] 
5.141 


0.205 


[0.142,0.265] 
8.112 


0.302 


[0.233, 0.366] 
1.421 


0.339 


[0.272, 0.397] 
16.36 




Man 


GRV 


0.055 


(0.035) 


0.172 


(0.052) 


0.363 


(0.067) 


0.540 


(0.076) 


0.639 


(0.072) 






Mantel 


0.051 


[0.019,0.089] 
1.373 


0.124 


[0.067,0.171] 
4.512 


0.216 


[0.154,0.288] 
4.371 


0.360 


[0.282,0.432] 
9.289 


0.401 


[0.321,0.465] 
7.233 




Mah 


GRV 


0.173 


(0.055) 


0.675 


(0.068) 


0.925 


(0.038) 


0.987 


(0.015) 


0.993 


(0.010) 






Mantel 


0.062 


[0.011,0.119] 
0.8066 


0.241 


[0.180,0.314] 
11.89 


0.515 


[0.478, 0.554] 
151.0 


0.688 


[0.672, 0.702] 
654.4 


0.765 


[0.755,0.776] 
1068 



set {1, . . . , N} to itself, the rows and columns of G y are simultaneously permuted by it and 
denoted G Vt7T . This generates the set {GRV(G X , G 2/]7r )} 7rg n and the p-value of an observed 
GRV statistic, GRV(G Z , G y ), can be obtained as the proportion of permuted statistic values 
greater than or equal to GKV(G X , Gy). This is a right-tailed test as larger GRV values provide 
evidence against the null. 

In order to estimate the p-value without permutations, we adopt a moment matching 
approach where the exact null distribution which would be obtained if all N\ permutations 
were used is approximated by a continuous distribution. In particular, we approximate the 
null distribution by the Pearson type III distri bution, which has been shown to capture sk ewed 
characteristics of null sampling distributions (jMielke and Berrvl . 120071 ; iJosse et all 120081 ) . To 
use this distribution we require the first few moments of the exact permutation distribution 



of T; the mean is /i 



W Syren tire variance is a 2 = X^erT ^7? ~~ ^ an d the skewness is 



7 



where T n = tr(G x Gj /j7r ) and II contains all N\ permutations. No permutations are needed 
to compute |m, c, 7), since closed-form expres sions of these quantities are retrievable via the 



analytical results of iKazi-Aoual et al 



(| 19951 ). These results require that G x and G y are 



square, symmetric and centered; properties satisfied by definition. 

On obtaining closed-form expressions for the mean, variance and skewness of the exact 
permutation distribution of T, we standardize T by subtracting fi and dividing by a. The 
resulting T s = (T — fj)/a is then assumed to have the Pearson type III distribution under 
the n ull with probability density function (PDF) denoted /t„(£;7); see iMielke and Berry 
(120071 ) for the full definition of the distribution. We denote the CDF of T s by J^r,{t;i). The 
approximate null distribution of GTW(G x ,G y ) = T/\\G x \\\\G y \\ can then be derived by a 
simple transformation of the distribution of T s . The CDF of GRV, .Fgrv^; 7)) is given by 



t— / *^ I I ^ ' X I I I I X! I 

>grv(^;7) = >t s ' 



a 



and this is a valid CDF since J-t s (; 7) is a valid CDF. The p-value of an observed GRV statistic 
GSV(G X , G y ) is then estimated by 1 - J c " G rv(GRV(G ! x , G y );j). The PDF f G Rv(x, 7) is found 
by differentiation as 



/grv(z,7) 



\G X \\ \\G % 



a 



x G x G h 



a 



-;7 



since 



|G r ||||G,,|| > and a > 0. 



3 Power Studies 

In this section we assess the power of the proposed GRV test and compare it to the classical 
Mantel test. A Monte Carlo procedure is used to estimate the power of the GRV test, and 
in order to estimate the power of t he Mantel test we use the recently proposed algorithm of 



Gandv and Rubin-Delanchvi (|2013l ). This algorithm estimates the power of a Monte Carlo 



testing procedure and states a confidence interval around this estimate boasting a guaranteed 
coverage probability. It also states the average number of Monte Carlo permutations required 
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to achieve the given power estimate. It requires specification of the maximum length of the 
resulting confidence interval and the desired coverage probability, and then runs as many 
permutations as required to yield a power estimate with a confidence interval of length no 
greater than specified. Using this approach we are able to demonstrate that, at least for the 
experimental setups considered, many millions of permutations are needed for the Mantel 
test to achieve power of the same precision as the GRV test, and that the GRV test achieves 
greater power. 

We use simulated data that mimics the experimental data of a typical eQTL application 
where transcriptional measurements are paired with SNPs. We first generate an N x P matrix 
X containing N simulated SNPs (i.e., minor allele counts), denoted Xi = (xn, . . . , Xip) T 
for i = 1,...,N, with varying minor allele frequencies (MAFs) at each of the P SNPs. 
The MAF of SNP p, m p , is first simulated from a uniform distribution ?7(0.1,0.5), and the 
allele count is then simulated from a multinomial distribution with probabilities (1 — m p ) 2 , 
2m p (l—m p ), and m p of observing 0, 1 and 2 minor alleles, respectively. The paired data matrix 
Y = (yi, . . . , Un) T is then generated as follows. The N x 1 vector z = Xlp = {z\, . . . , z^) T 
containing the row sums of X , i.e., the minor allele counts across the P SNPs, is computed, 
and yi = ZilQ + e* for i = 1, . . . , N, where e; ~ Nq(/j,, £), with fi = {p,\, . . . , ^q) t and 
fiq ~ £7(0, 1) for q = 1, . . . ,Q, and £ a random Q x Q Wishart matrix. We then consider 
the widely-used identity-by-state (IBS) distance for the simulated genetic markers in X, 
which measures the degree to which alleles are shared across the SNPs (see, for instance, 



Goh et all (I201ir0. in a dditio n to the Sokal and Sneath, and Rogers and Tanimoto I distances 



(jSelinski and Ickstadtl . l2005h . For the observations in Y we use the Euclidean, Manhattan 
and Mahalanobis distances. See Appendix ICl for details on these distances. 

For P = 2, Q = 10, and each of N = {30, 50, 70, 90, 100}, the power of the GRV test is 
computed using 50 Monte Carlo runs, and each time generating 50 paired datasets as above. 
For each genetic and gene expression distance, the GRV test is applied and the proportion 
of p-values less than or equal to the significance level of 0.1% is recorded. The mean power 
across all 50 Monte Carlo runs for each distance combination and value of N is reported in 
Table [TJ in addition to the standard deviation of these estimates. As expected, the power 
increases with N. 

We then estimate the power of the Mantel test for each of the above settings using the 
algorithm of Gandy and Rubin-Delanchy ( 20131 ). For each setting we bound the confidence 
interval length by twice the standard deviation of the corresponding estimate for the GRV 
test. That is, we consider one standard deviation on either side of the power estimate as an 
empirical indication of the precision achieved by the GRV test. We monitor the number of 
Monte Carlo permutations required to obtain power estimates with such confidence intervals 
with a coverage probability of 95%. These results are also given in Table 1, where the power 
is stated alongside the confidence interval, and the number of Monte Carlo permutations 
required is stated on a separate line below the confidence interval. 

In all settings the GRV test achieves greater power than the Mantel test, even when using 
large numbers of Monte Carlo permutations. Note that while the Mantel power estimates 
improve with N, as expected, the number of Monte Carlo permutations varies between O(10 5 ) 
to O(10 9 ) non-linearly with N. This is an artefact of the algorithm; the expected number 
of permutations depends on the le ngth of the confidence interval sou ght and the region of 
the confidence interval in [0, 1] (see lGandy and Rubin-Delanchyl (120131 ) for more details). An 
example of how the standardization used by GRV yields greater power than that used by 
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Mantel is presented in Appendix [Dl in addition to simulation results demonstrating that the 
GRV test attains the nominal significance level of a given test. 



4 Application to ovarian cancer 

Around 225, 000 women were estimated to have developed ovarian cancer in 2008, of which 
over 80% are Epithelial Ovarian Cancer (EOC). In the UK EOC has particularly poor prog- 
nosis with only 34% of patients surviving 5 years after diagnosis. 

In many cancers, the ability to measure some biomarker that can predict the likely re- 
sponse to a chemotherapeutic agent could prevent patients unnecessarily suffering side-effects 
from ineffective treatments, and may suggest the best course of action where multiple treat- 
ments are available. Unfortunately, while high-throughput gene ex pression profiling has been 
succe ssful in identifying prognostic 'signatures' for many cancers ( van't Veer and Bernards! . 
20081 ) , solid tumour gene expression profiles have practical drawbacks preventing their use as 
prognostic or predictive biomarkers in the clinic, especia lly in terms of obtaining mRNA from 
the tumour and the transitive nature of gene expression ( Sawyers] . 2008 ). Howe yer, circulatin g 
tumour DNA can be isolated from serum samples in ovarian cancer patients (jHickeyi . 
which reflect an attractive source for genetic biomarkers both because of the relative ease of 
accessibility of the patient material and the long-term stability of the genetic sequence. Addi- 
tionally, treating biological pathways as multidimensiona l biomarkers may provide substantial 
improvements in terms of robustness over single agents ( Dai et al. . 201 ll ). 



We sought to discover if it might be possible to use genetic biomarkers as surrogates for the 
transcriptional activation of pathways. Such pathway-based genetic biomarkers may suggest 
starting points for the development of clinical diagnostic tests to predict the outcome of a 
therapeutic exploiting a related pathway. 

In order to identify pathways for which the profile of SNP genotypes across genes in the 
pathway closely reflects the state of expression of the genes in the pathway, we applied the 
GRV test to evaluate the significance of the degree of similarity in distance matrices derived 
from SNP genotypes and gene expression levels from a large set of selected pathways, for two 
independent cohorts. 



4.1 Datasets 

For this study we use two independent cohorts of ovarian cancer patients: one collected by the 
TCGA consortium, and an independent cohort for validation. For the TCGA cohort, raw data 
CEL files from Affymetrix SNP 6.0 and from Affymetrix HT-HGU133A microarray profiling 
499 primary high grade serous ovarian tumours were obtained from the TCGA web reposi- 
tory. For the Affymetrix SN P 6.0 arrays, no r maliz ation and genotype calling were performed 



using the CRLMM method (jCarvalho et all 120071 ) . Current annotations were obtained from 



Affymetrix's NetAffx Analysis Center and SNPs were annotated with their associated Gene 
Symbols. Once SNP probes were mapped to Gene Symbols, lists of SNPs corresponding to 
each CPDB pathway were obtained by selecting all SNP probes mapping to Gene Symbols 
that were included in the pathway's list of hgnc_symbol_ids as downloaded from Consensus- 



PathDB (CPDB) (IKamburov et all \2Q11\ ). For the A ffymetrix HT-HGU1 33A arrays, data 



normalization was performed using the RMA method ( Irizarrv et al. . 20031 ). Probe-set map- 
ping to Gene Symbols was performed using the hthgul33a.db package from Bioconductor. 
Once probe-sets were mapped to Gene Symbols, lists of probe-sets corresponding to each 
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CPDB pathway were obtained by selecting all probe-sets mapping to Gene Symbols that 
were included in the pathway's list of hgnc_symbol_ids as downloaded from CPDB. 

As an independent validation data set, raw data CEL files from both Affymetrix SNP 
6.0 and Affymetrix HGU133plus2 microarrays profiling 60 primary high grade serous ovarian 
tumours were obtained from the Hammersmith Biobank Resource Centre. The datasets were 
provided courtesy of the Genome Institute of Singapore. Raw data for this independent 
tumour cohort was processed identically to those obtained from TCGA, although mapping of 
gene expression probe-sets to Gene Symbols was performed using the hgul33plus2.db package 
from Bioconductor. The tables of genotype calls and normalized gene expression values, 
mapped to pathways which were used in this analysis, are provided in Supplementary Tables 
SI and S2, respectively. 

We wanted to reduce the possibility that any identified associations simply reflected 
genome-wide similarity structures between the patients (e.g. due to molecular subtypes of 
disease) or involved so few measurements that they weren't indicative of the entire pathway. 
From the full list of 4, 119 CPDB pathways we filtered out all those with less than 5 or more 
than 200 probe sets in either gene expression or SNP genotype array datasets. A total of 479 
pathways remained after this filtering, and these pathways are listed in Supplementary Table 
S3 along with their mappings to probes in each of the two datasets. 



4.2 Experimental results 

Since different distance measures highlight different re l ationships among genetic markers and 
features of gene expression data ( Priness et al . 2007 : Song et al . 20121 ). we sought to per- 



form a meta-analysis to combine the results of similarity testing across a range of distance 
measures for both the genotypes and gene expression values. The rationale being that if a 
given pathway shows similar distance matrices at the genotype and gene expression level for a 
number of different distance measures, the observed similarity is unlikely to be due to a quirk 
of one particular distance measure. For each of the 479 pathways, sample-wise distance matri- 
ces were calculated using 8 different distance measures (Euclidean, Mahalanobis, Manhattan, 
Maximum, Bray-Curtis, Pearson's correlation, Spearman's correlation and Cosine angle dis- 
tances) for the gene expression data and 5 distance measures (IBS, SS, RTI, Simple Matching 
and Hamman I) for the SNP genotypes; see Appendix O for details on these distances. The 
association between the paired distance matrices, one derived from the genotypes of SNPs 
mapping to genes in the pathway and the other derived from the expression levels of genes in 
the pathway, was evaluated using the GRV test. This resulted in 40 individual p- values for 
each pathway, one for each combination of distance measures used. An illustration of how 
the the null distribution compares with the permutation distribution when applied to this 
cancer data is presented in Appendix [El The 40 individual p-values for each pathway were 
then combined using the 'maximum' method as implemented in the R package MetaDE. The 
combined significance estimates from running this meta-analysis for each pathway are given 
in Supplementary Table S4. This meta-analysis was then repeated for the validation cohort. 

To obtain a perspective of the robustness/repeatability of the associations detected be- 
tween genotype and gene expression at the pathway level across the two independent cohorts, 
we compared the ranked lists of pathways arising from the meta-analysis of each cohort. 
The degree of agre ement between the t wo ranked lists was evaluated using the normalized 
Canberra distance ( Jurman et al. , 20081 ). The general principle is to quantify the distance 



between the rankings of the top k terms of two ranked lists, with each term's influence on the 
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overall distance weighted by its position in its respective list. By evaluating the normalized 
Canberra distance for the top-ranked k pathways, the relationship between overlap and rank- 
ing can be explored: smaller values indicate a lesser discrepancy in rankings among the top 
k terms and therefore more similar rankings in the two lists. Figure Q] shows the normalized 
Canberra distance plotted against k. 




k 



Figure 1: Plot of the normalized Canberra distance against k for the top k elements in two lists of 
pathways. Each pathway list is ranked according to meta-analysis estimate of significance of association 
between genotype and gene expression over 40 combinations of distance measures, one ranked list per 
cohort. The dotted line represents the average normalized Canberra distance between the TCGA 
data-derived pathway list and each of 5, 000 random permutations of the ranks in the second list. 



It can be seen that the overlap of the meta-analysis results from the two cohorts is greatest 
for small k, implying that the highest-ranking pathways in each list are more similar than 
the lower-ranking pathways. Furthermore, for each k we used 5, 000 permutations of the 
ranks in the second list to obtain a p-value estimate for the normalized Canberra distance 
values observed between the two rankings. These p-values across the range of k were less 
than 0.005 after false discovery rate multiple-testing adjustment, reflecting the fact that the 
overlap between the two ranked lists of pathways was significantly greater than that expected 
by chance. These results suggest that the association between distance matrices based on 
the levels of expression of genes within certain pathways and distance matrices based on 
SNP genotypes associated with genes in the corresponding pathways is robust and reliably 
evaluated by the GRV test. The meta-analysis of the GRV test result across multiple distance 
measures demonstrates that these associations are not a result of peculiarities of particular 
distance measures, and certain pathways are shown to exhibit a very strong association in 
both cohorts. 

A particularly striking result of this study is the CPDB pathway '5-aminoimidazole ri- 
bonucleotide biosynthesis', which is ranked #1 in the meta-analysis of GRV test results from 
the TCGA dataset and ranked #2 in the meta-analysis of GRV test results from the validation 
dataset. The 5-aminoimidazole ribonucleotide is a precursor to de novo purine biosynthesis, 
which is essential for rapidy proliferating tissue (jFirestine and Davissonl . 1 19931 ). Pharmaco- 
logical inhibition of de novo purine biosynthesis has proved to be a successful strategy in the 
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clinic for treating cancers and inflammatory disorders ( Christopherson et al. . 20021 ). It has 



been shown through a clinical trial that genetic polymorphisms could be biomarkers to pre- 
dict response to methotrex ate therapy, which inh ibits de novo purine biosynthesis, in patients 
with rheumatoid arthritis ( Dervieux et al. . 20041 ) . The 'methotrexate pathway' itself showed 
significant genotype-expression association in the TCGA cohort (ranked #17), but not in the 
validation cohort (ranked #316). The '5-aminoimidazole ribonucleotide biosynthesis' path- 
way may therefore suggest additional or alternative SNPs to use as genetic biomarkers for 
predicting outcome of inhibiting de novo purine biosynthesis. 

Another result of note is that the CPDB pathway Platinum Pathway, Pharmacokinet- 
ics/Pharmacodynamics yields highly significantly similar tumour- wise distance matrices across 
all 40 applications of the GRV test on this paired ovarian cancer data, for both cohorts inves- 
tigated (ranked #14 from the TCGA dataset and #42 from the validation dataset). This is 
particularly interesting because platinum-based chemotherapy is the standard treatment for 
ovarian cancer, but unlike for many other solid cancers, survival rates in ova rian cancer have 
not i mproved noticeably in the 30 years since this treatment was introduced ( Vaughan et all , 
201 ih . The principal reason for this is the h igh frequency of pla t inum - resistant relapse , whic h 



may occur through multiple mechanisms ^arwal and Kavel . » Istronach et all B. 
Therefore, being able to predict the response to the platinum-based DNA damaging agents 
could be of great clinical value in the treatment of ovarian cancer. Additionally, all genotype- 
derived and expression-derived distance matrices for the Base Excision Repair pathway are 
highly significantly similar in the TCGA cohort. This is of interest as the Base Excision 
Repair pathway is involved in repairing platinum-induced DNA damage. In fact, effective 
inhibition of this pathway (via PARP) in patients wit h BRCAl/2 mutations is associated 
with sensitivity to platinum chemotherapy ( Fong et al , 2010l ). 



5 Conclusion 

In this work we have presented the GRV test as a novel procedure to detect association 
between paired distance matrices, which is applicable in any setting where suitable distance 
measures can be defined. Similarly to the widely used Mantel test, the GRV test can be 
directly applied whether the distance measures are metric or semi-metric, but overcomes the 
limitation of the Mantel test that, where paired data is vector- valued, a hypothesis of no 
correlation between the vectors of interest can be tested. As a further advantage of the GRV 
test over existing distance-based tests, an approximate null distribution is proposed such 
that inference can be performed without expensive Monte Carlo permutation procedures. 
Through extensive simulations we have demonstrated that the GRV test achieves greater 
power than the popular Mantel test. This greater power is exhibited even when many millions 
of permutations are used for the Mantel test, thus demonstrating its suitability as a tool for 
use in multiple-testing settings were many tests need to be simultaneously performed. 

The GRV test was applied to paired ovarian cancer datasets in which genome-wide pro- 
files of SNP genotypes and gene expression levels were available for each patient across two 
independent cohorts (N = 499 and N = 60). The paired datasets were split into subsets 
in which the features all mapped to the same pathway (using mappings from the Consensus 
Pathway DB), so that each GRV test result would indicate the statistical significance of the 
similarity between the (tumour-wise) distance matrix derived from the genotype data and the 
distance matrix derived from the gene expression data. For each cohort a meta-analysis was 
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performed across the GRV test results from 40 different combinations of distance measures, 
and the pathways were ranked according to their significance across all distance measures. 
Permutation-based estimates of the significance of the overlap between the ranked lists of 
pathways from the two independent cohorts revealed a highly significant agreement in the 
two rankings, indicating a degree of robustness in the results of our analysis. 

This application of the GRV test has demonstrated that the transcriptional activity of a 
number of pathways seem to be reliably predicted by sequencing a limited set of genetic mark- 
ers that could be detected in circulating tumour DNA obtained from patients' serum. This 
relationship was particularly pronounced for the pathway '5-aminoimidazole ribonucleotide 
biosynthesis', for which the genotypes and the gene expression patterns were strikingly as- 
sociated in two independent cohorts of patients. The association between genotypes and 
pathway-level gene expression profile is of particular interest in the case of the platinum re- 
sponse pathway, which is clearly clinically important in EOC given that the only standard 
treatment involves platinum-based chemotherapy. Following further investigation, these ob- 
servations could potentially lead to development of non-invasive biomarkers that could predict 
patient response to front-line chemotherapy, and serve to stratify patients for selection into 
clinical trials involving alternative treatments. 

As individual distance measures can yield different results in the analysis of associations 
across gene expression patterns, unless a single distance measure is selected a priori as the 
most appropriate for each dataset, combining results of multiple tests of association using 
different distance measures may provide more reliable results than a single test. A key ad- 
vantage of the GRV test in our application to multivariate eQTL mapping of pathways in 
ovarian cancer is that it enables a meta-analysis of the associations between genetic distance 
and gene expression distance for different combinations of distance measures. Having a test 
for these associations that does not require expensive permutations enables a fast estimation 
of the robustness of the genotype-expression pathway level associations against quirks of par- 
ticular definitions of distance. Given the robust associations discovered in this analysis, we 
suggest that it may be possible to predict the mode of activation of a pathway (in terms of 
which members are transcriptionally activated) in cancer patients, based on the genotype of 
the tumours' DNA across a limited panel of SNPs. 
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A GRV statistic is a correlation coefficient 

Consider the quantity cor(g x , g y ). To compute this we require the mean and standard devi- 
ation of the values in g x and g y . The means are since G x and G y are centered matrices. 
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The standard deviation of the elements in g x is given by 
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and similarly for g y . Thus the correlation of interest is given by 
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as required. 



B Properties of the GRV statistic 

The interpretation of GRV as a correlation coefficient indicates that it may range between 
— 1 and 1, with negative values indicating association in the form of a linear correlation of 
different sign (as with Pearson's correlation and Mantel). In this section we discuss how to 
interpret the values of GRV. 

To begin, we show that the GRV statistic is related to the Frobenius distance between 
G x and G y . Recall that the Frobenius distance between two matrices A and B of equal 
dimensions is defined by cLf(A, B) = y/ti ((A — B) T (A — B)). It is easily shown that the 
Frobenius distance between the scale invariant configurations G^/HG^I and Gy/HGyH is 
related to the GRV statistic by 

dF ( nl^j, j|§Ji| ) = ^2(1-GRV(G„G J/ )). (4) 

When d F (G x /\\G x \\,G y /\\G y \\) 
GRV{G x ,G y ) = 1, i.e., when 



= 0, observe that perfect association is achieved when 

||G;r|| ||GJ| 
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This equality, however, can only be attained if G x and G y are both positive semi-definite 
(having non- negative diagonals), or both indefinite (having non-negative and negative values 
on the diagonals). These occur if d x and d y are both metric, or semi-metric, respectively. 
When one distance function is metric and the other is semi-metric, perfect association cannot 
be attained, as the diagonals of G x and G y cannot be equal. In addition, negative values are 
attained for GRV for certain combinations of these metric properties. 

To see this, consider the upper and lower bounds of the GRV statistic, found as follows. 
Let the ordered eigenvalues of G x and G y be denoted {^x,i}^=i and {A,,.;)^ , respe ctively, 
and consider the bounds of the quantity tr (G x G y ). Using the result of Lasserre! (jl995l ). which 



gives bounds for the trace of the product of two Hermitian matrices (square, complex- valued, 
and equal to their conjugate transpose), these are given by 

N N 

N-i+1 — tr(G x G y ) < \c,i\/,i- 
i=l i=l 

The bounds for GRV are then given by 

< GRV(Ga;, Gy) < , 

ll tT :r|Ml t - r 3/ll ll < - r a;MM ljr i/ll 

since 1 1 G x \\\\ G y \ \ > 0. The numerators of the bounds of GRV are sums of eigenvalue products, 
and so may be non-negative or negative depending on the sign of the eigenvalues. This in 
turn depends on the distance functions satisfying the metric property; they are either both 
metric or both semi-metric, or one is metric and the other is semi-metric. Note that greater 
Frobenius distance values occur with lower GRV values which may be negative, and so the 
paired distance matrices are considered to be less associated. 

B.l Both metric or semi-metric distances 

When the distances are both metric, G x and G y are positive semi-defi nite and the ordered 
eigenvalues {A^j}^ and {X y ^}fL l are non-negative ( Krzanowski . 200d ) . The summation in 
the lower bound contains the terms {\ X: i\ y: N-i+i}iL\, which are therefore non-negative, so 
that GRV is non-negative. The minimum value of indicating no association is attained 
when tr (G x G y ) = 0. This can be interpreted in terms of the N x N principal coordinate 
matrices arising from A x and A^, denoted X and Y, respectively. These matrices are found 
by classical multidimensional scaling (MDS) such that the rows are ^-dimensional represen- 
tations of the original observations in Euclidean space, and t heir pairwise Eucl i dean distance s 
equal the corresponding pairwise distances in A. x and A. y ( Torgerson . 19521 ; Gower . 19661 ). 



It then follows that G x = XX T and G y = YY T , so that tr (G x G y ) = tv(XX T YY T ) = 
=^ X T Y = 0, i.e., the principal coordinate matrices are orthogonal. The maximum value 
GRV can take is 1, since G x and G y have positive diagonal elements so that equality ([5]) can 
be attained. In this case the distance matrices are equal up to a positive scaling factor, and 
there is perfect association. 

Note also that when X and Y are centered vector-valued observations with X T Y = 0, 
and d x and d y are the Euclidean distance functions, GRV yields a value of 0. This is because 
XX T = XX T and YY T = YY T , so that tr (G x G y ) = ti(XX T YY T ) = tr (XX T YY T ). 
This result applies when the data is scalar-valued, so GRV equals inline with the raw data 
being uncorrelated. Thus GRV overcomes the limitation of Mantel. 
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On the other hand, when the distances functions are both semi-metric, the ordered eigen- 
values {A a ; i j}^ 1 and {A ?/i i}^ 1 are both non-negative and negative, so that GRV may attain 
negative values. Since the diagonal elements of G x and G y are both positive and negative, 
there may exist two such matrices with equal diagonals. Hence there may exist a scenario in 
which equality (0) is attained, and perfect association is indicated by a GRV value of 1. 



B.2 Metric and semi-metric distances 

When one of the distances, d x for instance, is metric while d y is semi-metric, then only the 
ordered eigenvalues {X X: i}fL 1 are strictly non-negative, so that the summation in the lower 
bound may be negative. In this case, GRV may attain negative values. The maximum 
attainable value of GRV is not 1 in this case, since equality ([5]) cannot be attained. This 
is because the diagonals of G x are positive while the diagonals of G y are both positive and 
negative. Thus perfect association cannot be attained, but larger values indicate greater 
association. 

This result is not surprising because the relationship between pairwise distances with 
respect to d x and d y cannot be the same. Recall that metric distance functions satisfy the 
triangle i nequality so that d x (xj,Xj) < d x (xi,X) i .) + d x (xk,Xj) for any three observations 

Xj, Xj , Xfc 



(|Mardia et all . ll979h . The corresponding distances with respect to d y will not share 



this property as d y is semi-metric, that is, d y (yi,yj) does not satisfy d y (yi,yj) < d y (yi,yk) + 
dyiyk-, Vj) for yt, yj,yk- Thus the inter-point relationships between all the distances in A. x will 
not match those in (for if they did d y would satisfy the triangle inequality and hence be 
metric). 



C Distance measures 
C.l SNPs 

Assume two P-dimensional vectors x = (x\, . . . ,xp) T and y = (y±, . . . ,yp) T with discrete- 
valued elements representing minor allele counts at P SNPs. The identity-by-state (IBS) 
distance measure is defined as 



1 p 

diBs{x, y) = 1 - — s{x p , y p ), 



2P 

where s(x p ,y p ) = if x p = and y p = 2, or if x p = 2 and y p = 0, s(x p ,y p ) = 1 if x p = 1 
and y p ^ 1, or if y p = 1 and x p ^ 1, and s(x p ,y p ) = 2 if x p = y p . This distance takes values 
between and 1 and is semi-metric. 

Genetic distances have also been proposed based on the contingency ta ble containing the 



frequ ency that each combination of minor allele counts occurs over the SNPs (jSelinski and Ickstadt 
200fJ >: see Table 



The key statistics in this table are the number of complete matches of the minor allele 
counts, m + = Y^k=o m kk, and the number of mismatches, m~ = P—m + , where the total num- 
ber of possible matches is P. Based on these quantities, the following 'matching coefficient' 
distance measures can be defined; the Simple Matching distance 

d S hi{x,y) = 1 - — , 



16 



Table 2: Contingency table containing the frequency of a given combination of minor allele count 
between x and y over the P SNPs. rriki is the frequency of x having k minor alleles and y having I 
minor alleles. 
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the Sokal and Sneath (SS) distance 

dss{x,y) = 1 - 

and the Rogers and Tanimoto I (RTI) distance 

d RT i(x,y) = 1 - 



m 



m + + ^m 



m 



m + + 2m 

There is also the Hamman I similarity measure 

m + — mT 
SHi{x,y) = - , 

which can be transformed into a distance measure as follows. Assume N P-dimensional 
minor allele count vectors {xi}f =l ; this is required in order to normalize the magnitude of the 
similarities to the range [0, 1]. The Hamman I distance between Xi and Xj is then given by 

d m {xi,Xj) = 1 s (x t Xj) 

m&x{s*{Xi, Xj) \ 

where s*(xi,Xj) = SHi(xi,Xj) + \min{sHi(xi,Xj)}\. This takes values between and 1 and 
is semi-metric. 

C.2 Vectors 

Assume two P-dimensional real- valued vectors x = (x-\, . . . ,xp) T and y = (yi, . . . , yp) T ■ 
Many measures exist (see, for example, Pekalska and Duin ( 20051 )). of which a few are pro- 



vided in Table El along with their ranges and properties, i.e., whether they are metric or 
semi-metric. These include the Euclidean, Manhattan, Maximum, Bray-Curtis, Pearson's 
correlation and the Cosine angle distances. 

Spearman's correlation distance is defined by applying Pearson's correlation to the ranks 
of the elements of the vectors, rather than the actual values. In particular, let x r = 
) T and y r = (y r \, ... , y r p) T be the vectors containing the ranks of the elements 
of x and y, respectively in ascending order (highest value given rank 1). That IS -j Xyp IS the 
rank of x p , and similarly for y rp . If several elements of a given vector are equal, they are 
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Table 3: Commonly encountered distance measures for vector-valued objects. The (M) or (SM) by 
each distance name indicates whether it is metric or semi-metric. 



Distance 



Notation 



Definition 



Range 



Euclidean (M) 
Manhattan (SM) 
Maximum (SM) 

Bray-Curtis (SM) 
Mahalanobis (SM) 



Pearson's 
correlation (SM) 

Cosine 
angle (SM) 



dE(x,y) 
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d PC (x,y) 
dpc(x,y) 



x - y) T (x - y) 

E P =i \ x p ~ Vp\ 

max{|x p - y p \} 



E P =i \ x p Vp\ 

EJ=i {x P + y P ) 



(x-y) T S-Hx-y), 
SaPxP covariance matrix, P < N 



Ep=i ( x p ~ x ) (Vp - v) 



E 



x ) 2 Ep=i (v P - y) 

p l^p=\ Vp 



p=l \ x p 



p i^p=\ x pi y 
1 



2 ' 



T 

x L y 



\x\\ \\y\ 



X 



E 



p=i 



\y\ 



E P =i Vp 



[0,oo) 
[0,oo) 
[0,oo) 

[0,oo) 
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assigned a rank equal to the mean of their respective positions in the list of ascending values. 
For example, for vector (0.1, 0.4, 0.4, 0.5, — 31) T , their respective positions are (4, 3, 2, 1, 5) T 
or (4, 2, 3, 1, 5) T , so that the ranks are given by (4, 2.5, 2.5, 1, 5) T . Spearman's correlation 
distance between x and y is thus given by 

dsc{x, y) = d PC (x r ,y r ), 

which ranges between and 2 and is semi-metric. 

The normalized mutual information (NMI) distance is defined as follows. The P elements 
of x and y are considered to be observations of the random variables x and y, respectively. 
Let Px(-) denote the PMF of x found by considering the histogram of the elements of x with 
M bins. That is, p x {i) gives the proportion of the elements {x p }^ =1 in the i th bin. We follow 



Priness et ali (120071 ) and use the integer value of yP as M. Estimate the entropy of x by 



M 



E(x) = Px{m)\og (p x (m)) , 



m=l 



and similarly for y with PMF p y (-)- Estimate the joint entropy of x and y by considering 
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the joint PMF, denoted {p xy (i, j)}f; = i, using a two-dimensional histogram of x and y, as 



M M 



E(x,y) = - ^2 ^^J/( m ' n ) lo s(^s/( m ' n )) 



m=l n=l 

The NMI distance measure is then given by 

d NM i(x,y) = 1 



E(x)+E(y)-E(x,y) 



max {E(x), E(y)} 

where the fraction is the NMI ( Michaels et al. . 19981 ). This distance is bounded by and 1 
and is semi-metric. 



D Simulation results 

D.l Comparison between GRV and Mantel 

Since the GRV and Mantel statistics can both be expressed as correlation coefficients, the 
difference in performance is due to the difference in standardized distance elements used as 
inputs in each case; a classical standardization is applied to the upper-triangular distance 
elements by Mantel, and a normalized double-centering is applied to the complete distance 
matrix by GRV. Figure [2] demonstrates that, for the data simulated as described in Section 3 
of the article with N = 50 and with the IBS and Mahalanobis distances, the standardization 
used by GRV yields a stronger correlation than the standardization used by Mantel. This is 
an example of how the standardization used by GRV better detects the association inherent 
in the simulated data. 




Figure 2: Standardized and normalized double-centered distances (gray points) used by the Mantel 
and GRV statistics, for the IBS and Mahalanobis distances for N = 50. In each case the gradient of 
the black line equals the correlation and hence the statistic value. The standardization used by GRV 
yields a higher correlation than achieved by the standardization used by Mantel. 
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Table 4: Mean (and standard deviation) of the proportion of null p-values less than or equal to the 
given significance levels for N — {30, 50, 70}. 









Significance level (%) 






N 


1 




5 


10 




30 


0.010 


(0.007) 


0.051 (0.014) 


0.100 


(0.020) 


50 


0.010 


(0.006) 


0.050 (0.016) 


0.101 


(0.020) 


70 


0.011 


(0.006) 


0.051 (0.017) 


0.102 


(0.021) 



(a) IBS and Mah (b) SS and PC (b) RTI and NMI 




0.090 0.095 0.100 0.105 0.015 0.020 0.025 0.030 0.090 0.095 0.100 0.105 

GRV GRV GRV 



Figure 3: Sampling distributions of the GRV statistic obtained using 10 6 Monte Carlo permutations 
and the proposed approximate PDF applied to a pathway of the real data including a SNP-set com- 
prised of 2092 SNPs and a transcript-set comprised of 51 transcripts, (a) The IBS distance is applied 
to the SNP data and the Mahalanobis (Mah) distance is applied to the gene expression data, (b) The 
SS distance is applied to the SNP data and the Pearson's correlation distance (PC) distance is applied 
to the gene expression data, (b) The RTI distance is applied to the SNP data and the normalized 
mutual information (NMI) distance is applied to the gene expression data. 

D.2 Attainment of nominal significance level of GRV 

We demonstrate that the GRV test attains the nominal significance level of a given test for 
a range of significance levels by using a Monte Carlo procedure. We generate paired data 
inspired by the eQTL application setting used in the power study. 

For 100 Monte Carlo runs, 200 paired datasets are generated as described in Section 3 of 
the article for P = 2, Q = 10, and N = {30,50,70}, except with yi = ej so that yi is not 
dependent on aij, and the IBS and Mahalanobis distance measures applied. Over all runs the 
mean and standard deviation of the proportion of p-values less than or equal to significance 
levels of 1%,5%, 10% are monitored. These are presented in Table HI where it is clear that 
the nominal significance levels are attained. 

E Illustration of the null distribution of GRV for real data 

In order to illustrate how the null distribution compares with the permutation distribution 
when applied to real data, we consider a subset of the cancer data. For a SNP-set comprised 
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of 2092 SNPs and a transcript-set comprised of 51 transcripts, each of the following combi- 
nations of SNP and gene expression distances were applied: (a) IBS and Mahalanobis, (b) 
SS and Pearson's correlation, and (c) RTI and normalized mutual information. For each we 
obtained the approximate null distribution of the GRV statistic and the permutation distri- 
bution using 10 6 Monte Carlo permutations. These are given in Figure where we see that 
the approximate distribution appears to provide a good fit to the permutation distributions. 

References 

Agarwal, R. and Kaye, S. B. (2003). Ovarian cancer: strategies for overcoming resistance to 
chemotherapy. Nature Reviews Cancer, 3, 502-516. 

Beckmann, L. et al. (2005). Haplotype Sharing Analysis Using Mantel Statistics. Human 
Heredity, 59(2), 67-78. 

Beroukhim, R. et al. (2007). Assessing the significance of chromosomal aberrations in cancer: 
Methodology and application to glioma. PNAS, 104(50), 20007-20012. 

Berry, K. J. and Mielke, P. W. (1983). Moment approximations as an alternative to the F 
test in analysis of variance. British Journal of Mathematical and Statistical Psychology, 
36(2), 202-206. 

Borg, I. and Groenen, P. J. F. (2005). Modern multidimensional scaling: theory and applica- 
tions. Springer Science+Business Media, Inc., second edition. 

Carvalho, B. et al. (2007). Exploration, normalization, and genotype calls of high-density 
oligonucleotide SNP array data. Biostatistics , 8(2), 485-499. 

Christopherson, R. I. et al. (2002). Inhibitors of de novo nucleotide biosynthesis as drugs. 
Accounts of Chemical Research, 35(11), 961-971. 

Cookson, W. et al. (2009). Mapping complex disease traits with global gene expression. 
Nature Reviews Genetics, 10(3), 184-194. 

Dai, W. et al. (2011). Systematic CpG Islands Methylation Profiling of Genes in the Wnt 
Pathway in Epithelial Ovarian Cancer Identifies Biomarkers of Progression-Free Survival. 
Clinical Cancer Research, 17, 4052. 

Dervieux, T. et al. (2004). Polyglutamation of methotrexate with common polymorphisms 
in reduced folate carrier, aminoimidazole carboxamide ribonucleotide transformylase, and 
thymidylate synthase are associated with methotrexate effects in rheumatoid arthritis. 
Arthritis & Rheumatism, 50(9), 2766-2774. 

Escoufier, Y. (1973). Le traitement des variables vectorielles. Biometrics, pages 751-760. 

Firestine, S. M. and Davisson, V. J. (1993). A tight binding inhibitor of 5-Aminoimidazole 
Ribonucleotide Carboxylase. Journal of Medical Chemistry, 36, 3484-3486. 

Fong, P. C. et al. (2010). Poly (ADP)-ribose polymerase inhibition: frequent durable responses 
in BRCA carrier ovarian cancer correlating with platinum-free interval. Journal of Clinical 
Oncology, 28(15), 2512-2519. 



21 



Gandy, A. and Rubin-Delanchy, P. (2013). An algorithm to compute the power of monte 
carlo tests with guaranteed precision. The Annals of Applied Statistics, 41(1), 125-142. 

Goh, L. et al. (2011). Assessing matched normal and tumor pairs in next-generation sequenc- 
ing studies. PloS one, 6(3), el7810. 

Gower, J. C. (1966). Some distance properties of latent root and vector methods used in 
multivariate analysis. Biometrika, 53(3-4), 325-338. 

Hickey, K. P. o. (1999). Molecular detection of tumour DNA in serum and peritoneal fluid 
from ovarian cancer patients. British Journal of Cancer, 80(11), 1803-1808. 

Hollich, V. et al. (2005). Assessment of protein distance measures and tree-building methods 
for phylogenetic tree reconstruction. Molecular Biology and Evolution, 22(11), 2257-2264. 

Irizarry, R. et al. (2003). Exploration, normalization, and summaries of high density oligonu- 
cleotide array probe level data. Bio statistics, 4(2), 249-264. 

Josse, J. et al. (2008). Testing the significance of the RV coefficient. Computational Statistics 
and Data Analysis, 53(1), 82-91. 

Jurman, G. et al. (2008). Algebraic stability indicators for ranked lists in molecular profiling. 
Bioinformatics, 24(2), 258-264. 

Kamburov, A. et al. (2011). ConsensusPathDB: toward a more complete picture of cell 
biology. Nucleic Acids Research, 39(S1), D712-D717. 

Kazi-Aoual, F. et al. (1995). Refined approximations to permutation tests for multivariate 
inference. Computational statistics and data analysis, 20(6), 643-656. 

Knijnenburg, T. A. et al. (2009). Fewer permutations, more accurate P-values. Bioinformat- 
ics, 25(12), il61-il68. 

Krzanowski, W. J. (2000). Principles of multivariate analysis: a user's perspective. Oxford 
University Press. 

Lasserre, J. (1995). A trace inequality for matrix product. IEEE Transactions on Automatic 
Control, 40(8), 1500-1501. 

Legendre, P. and Fortin, M. J. (2010). Comparison of the Mantel Test and Alternative 
Approaches for Detecting Complex Multivariate Relationships in the Spatial Analysis of 
Genetic Data. Molecular Ecology Resources, 10(5), 831-844. 

Louhimo, R. and Hautaniemi, S. (2011). CNAmet: an R package for integrating copy number, 
methylation and expression data. Bioinformatics, 27(6), 887-888. 

Mantel, N. (1967). The Detection of Disease Clustering and a Generalized Regression Ap- 
proach. Cancer Research, 27(2), 209-220. 

Mardia, K. et al. (1979). Multivariate analysis. Academic Press. 

Michaels, G. et al. (1998). Cluster analysis and data visualization of large-scale gene expression 
data. In Pacific Symposium on Biocomputing , volume 3, pages 42-53. 



22 



Mielke, P. W. and Berry, K. J. (2007). Permutation methods: a distance function approach. 
Springer Science+Business Media, LLC. 

Minas, C. et al. (2011). Distance-based differential analysis of gene curves. Bioinformatics , 
27(22), 3135-3141. 

Pekalska, E. and Duin, R. P. W. (2005). The dissimilarity representation for pattern recogni- 
tion: foundations and applications. World Scientific Co. Pte. Ltd. 

Peres-Neto, P. R. and Jackson, D. A. (2001). How well do multivariate data sets match? The 
advantages of a Procrustean superimposition approach over the Mantel test. Oecologia, 
129(2), 169-178. 

Phipson, B. and Smyth, G. K. (2010). Permutation p- values should never be zero: calculating 
exact p- values when permutations are randomly drawn. Statistical Applications in Genetics 
and Molecular Biology, 9, 39. 

Priness, I. et al. (2007). Evaluation of gene-expression clustering via mutual information 
distance measure. BMC Bioinformatics, 8(1), 111. 

Sawyers, C. L. (2008). The cancer biomarker problem. Nature, 452, 548-552. 

Selinski, S. and Ickstadt, K. (2005). Similarity Measures for Clustering SNP Data. Technical 
Report / Universitt Dortmund, SFB 475 Komplexittsreduktion in Multivariaten Datenstruk- 
turen. 

Shannon, W. D. et al. (2002). Mantel statistics to correlate gene expression levels from 
microarrays with clinical covariates. Genetic Epidemiology, 23(1), 87-96. 

Song, L. et al. (2012). Comparison of co-expression measures: mutual information, correlation, 
and model based indices. BMC Bioinformatics, 13(1), 328. 

Stronach, E. A. et al. (2011). HDAC4-Regulated STAT1 Activation Mediates Platinum Re- 
sistance in Ovarian Cancer. Cancer Research, 71, 4412. 

Sun, Y. V. et al. (2011). Identification of genes associated with complex traits by testing the 
genetic dissimilarity between individuals. In BMC Proceedings, volume 5, page S120. Sun 
et al; licensee BioMed Central Ltd. 

Torgerson, W. S. (1952). Multidimensional Scaling: I. Theory and Method. Psychometrika, 
17(4), 401-419. 

van't Veer, L. J. and Bernards, R. (2008). Enabling personalized cancer medicine through 
analysis of gene-expression patterns. Nature, 452, 564-570. 

Vaughan, S. et al. (2011). Rethinking ovarian cancer: recommendations for improving out- 
comes. Nature Reviews Cancer, 11, 719-725. 

Wu, T. J. et al. (1997). A measure of DNA sequence dissimilarity based on Mahalanobis 
distance between frequencies of words. Biometrics, 53, 1431-1439. 



23 



